DCU-UVT: Word-Level Language Classification with Code-Mixed Data

نویسندگان

Utsab Barman

Joachim Wagner

Grzegorz Chrupala

Jennifer Foster

چکیده

This paper describes the DCU-UVT team’s participation in the Language Identification in Code-Switched Data shared task in the Workshop on Computational Approaches to Code Switching. Wordlevel classification experiments were carried out using a simple dictionary-based method, linear kernel support vector machines (SVMs) with and without contextual clues, and a k-nearest neighbour approach. Based on these experiments, we select our SVM-based system with contextual clues as our final system and present results for the Nepali-English and Spanish-English datasets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DCU-Symantec at the WMT 2013 Quality Estimation Shared Task

We describe the two systems submitted by the DCU-Symantec team to Task 1.1. of the WMT 2013 Shared Task on Quality Estimation for Machine Translation. Task 1.1 involve estimating postediting effort for English-Spanish translation pairs in the news domain. The two systems use a wide variety of features, of which the most effective are the word-alignment, n-gram frequency, language model, POS-tag...

متن کامل

UvT-WSD1: A Cross-Lingual Word Sense Disambiguation System

This paper describes the Cross-Lingual Word Sense Disambiguation system UvTWSD1, developed at Tilburg University, for participation in two SemEval-2 tasks: the Cross-Lingual Word Sense Disambiguation task and the Cross-Lingual Lexical Substitution task. The UvT-WSD1 system makes use of k-nearest neighbour classifiers, in the form of single-word experts for each target word to be disambiguated. ...

متن کامل

"ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification

Language identification is a necessary prerequisite for processing any user generated text, where the language is unknown. It becomes even more challenging when the text is code-mixed, i.e., two or more languages are used within the same text. Such data is commonly seen in social media, where further challenges might arise due to contractions and transliterations. The existing language identifi...

متن کامل

DCU-Lingo24 Participation in WMT 2014 Hindi-English Translation task

This paper describes the DCU-Lingo24 submission to WMT 2014 for the HindiEnglish translation task. We exploit miscellaneous methods in our system, including: Context-Informed PB-SMT, OOV Word Conversion (OWC), MultiAlignment Combination (MAC), Operation Sequence Model (OSM), Stemming Align and Normal Phrase Extraction (SANPE), and Language Model Interpolation (LMI). We also describe various pre...

متن کامل

Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text

Language identification at the document level has been considered an almost solved problem in some application areas, but language detectors fail in the social media context due to phenomena such as utterance internal code-switching, lexical borrowings, and phonetic typing; all implying that language identification in social media has to be carried out at the word level. The paper reports a stu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

DCU-UVT: Word-Level Language Classification with Code-Mixed Data

نویسندگان

چکیده

منابع مشابه

DCU-Symantec at the WMT 2013 Quality Estimation Shared Task

UvT-WSD1: A Cross-Lingual Word Sense Disambiguation System

"ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification

DCU-Lingo24 Participation in WMT 2014 Hindi-English Translation task

Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text

عنوان ژورنال:

اشتراک گذاری